Statistical Phylogenetic Tree Analysis Using Differences of Means 
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Abstract 

We propose a statistical method to test whether two phylogenetic trees with given alignments are sig- 
nificantly incongruent. Our method compares the two distributions of phylogenetic trees given by the input 
alignments, instead of comparing point estimations of trees. This statistical approach can be applied to gene 
tree analysis for example, detecting unusual events in genome evolution such as horizontal gene transfer and 
reshuffling. Our method uses difference of means to compare two distributions of trees, after embedding 
trees in a vector space. Bootstrapping alignment columns can then be applied to obtain p-values. To compute 
distances between means, we employ a "kernel trick" which speeds up distance calculations when trees are 
embedded in a high-dimensional feature space, e.g. splits or quartets feature space. In this pilot study, first 
we test our statistical method's ability to distinguish between sets of gene trees generated under coalescence 
models with species trees of varying dissimilarity. We follow our simulation results with applications to var- 
^ ious data sets of gophers and lice, grasses and their endophytes, and different fungal genes from the same 

t-h genome. A companion toolkit, Phylotree, is provided to facilitate computational experiments. 

ET 

1 Introduction 

6 

•p| Estimating differences between phylogenetic trees is one of the fundamental questions in computational biology. 

i Conflicting phytogenies arise when, for example, different phylogenetic reconstruction methods are applied to 
i i the, same data set, or even with one reconstruction method applied to multiple different genes. Gene phylo- 
^ genies may be codivergent by virtue of congruence (identical trees) or insignificant incongruence. Otherwise, 
they may be significantly incongruent lfT6l . All of these outcomes are fundamentally interesting. Congruence 
of gene trees (or subtrees) is often considered the most desirable outcome of phylogenetic analysis, because 
such a result indicates that all sequences in the clade are orthologs (homologs derived from the same ancestral 
CN sequence without a history of gene duplication or lateral transfer), and that discrete monophyletic clades can be 
unambiguously identified, perhaps supporting novel or previously described taxa. In contrast, gene trees that 
are incongruent are often considered problematic because the precise resolution of speciation events seems to 
be obscured. Thus, it would also be very useful to identify significant incongruencies in gene trees because 
> these represent noncanonical evolutionary processes, (e.g., |Il[331[T5l[T7]|). In this paper we propose a statistical 
hypothesis test which tells whether two phylogenetic trees are significantly incongruent to each other by com- 
5-i paring two distributions for phylogenetic trees, instead of comparing two point estimations. More specifically 
^ we will compare two distributions of trees using difference of means. In this paper we estimate these distribution 
by Bayesian sampling from the posterior distribution. Our statistical hypotheses are: 

H : Phylogenetic trees 7\ and T 2 are congruent. 
Hi'. Phylogenetic trees 7\ and T 2 are incongruent. 
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Usually a statistical test on the above hypotheses considers point estimates of the trees obtained by a tree recon- 
struction method, such as maximum likelihood estimates BSUl or the neighbor-joining method [|20l . See [|2T| 
and references within for an overview. Variation of reasonable tree estimates can be assessed, for example, by 
using the bootstrap or jackknife method. 

There are several techniques to test if gene trees are codiverged. For example, the Bayesian estimation 
methods (e.g., [HHUO), the Templeton test implemented in paup* 11241 (e.g., (HI), the partition-homogeneity 
test (PHT) also implemented with paup* (e.g., [27]), Kishino-Hasegawa test (e.g., |fT8l ), and the likelihood 
ratio test (LRT; e.g., [26]) are statistical methods to see if there is a "significant" level of incongruence between 
the trees (these methods are also called partition likelihood support (PLS) lfT3ll ). However, there is a limitation 
in many methods for comparing two phylogenetic trees: It is implicitly assumed that the two given trees are 
actually correctly estimated phylogenies. In reality, trees are estimated from observed data (e.g. fossil record, 
sequence data), and tree uncertainty is the rule instead of the exception. Thus, to estimate trees, we propose 
using posterior means instead of maximum likelihood, and we apply the bootstrap method to assess variation in 
the posterior means. Our method could also be applied with tree estimators like maximum likelihood, instead of 
the posterior mean. 

This paper is organized as follows: In Section 2, we state our method. In Section 3, we show simulation 
studies with data generated by the software Mesquite [Maddison Knowles 2006]. In Section 4, we apply our 
method to well-known gopher-louse data sets from |[T0l and grass-endophyte data sets from (|2T|. We end with 
a discussion. 

2 Materials and Methods 
2.1 Preliminaries 

Let T n be the space of trees on n taxa. When analyzing and comparing phylogenies, often tree features are used. 
The notion of tree features can be expressed formally as a vector space embedding: 

Definition 1. Given a vector space embedding v : T n — > M. m for some m, the vector v (T) is the feature vector 
ofT. 

The difference between trees Xi,T 2 can be quantified as the distance \ \v (T\) — t>(T 2 )||, where || • || is any 
norm. In this paper we will focus on L 2 norms. 

A notable example of our framework is the dissimilarity map distance. 

Definition 2. For T G T n , let v(T) = (<i^ 2 , d% 3 , . . . , d^_ l n ) G be the vector of pairwise distances dfj 
between leaves i and j in T. The dissimilarity map distance is 

d(T u T 2 ) = IKT0 - <T 2 )|| = ^(dl 2 -dllf + ... + {dlU, n -d T n U,n) 2 
where || • || represents L 2 norm (Euclidean length). 

In our computational experiments, we will use the dissimilarity map distance. Dissimilarity map distance 
was studied in 0. One can also consider a variation where all edge lengths are set to 1. The arising dissimilarity 
distance is called the path difference and only depends on tree topologies. 
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2.2 Testing for congruence of two trees 



In our framework, given are D x , D 2 , each a collection of n aligned homologous sequences. We assume D x , D 2 
were generated by models of sequence evolution on unknown trees Ti, T 2 . After embedding trees into a vector 
space, our statistical hypotheses are: 

H : \\v(T 1 )-v(T 2 )\\ = 0, 
H±: \\v(T 1 )-v(T 2 )\\>0. 

For convenience, we describe our approach as comparing two gene trees Ti, T 2 from the same set of species. 
One can also compare a phylogeny for host species and a phylogeny for corresponding parasites, as we do in 
section [3^2] 

Random fluctuations in sequence evolution can cause reconstructed gene trees for Di and D 2 to look at least 
slightly different, even if the true underlying trees are equal. Thus we need a way to tell if the difference between 
two estimated trees is "significant." 

One classical approach to assess variability in reconstructed trees is the bootstrap (61. The bootstrap gen- 
erates new hypothetical sequence alignments, by sampling (with replacement) columns of aligned sequence. 
Then trees can be re-estimated for each hypothetical alignment. One common application of the bootstrap is to 
measure support for each clade; clades that appear in most bootstrap replicate trees are regarded as likely clades 
in the true tree. 

Here we propose a bootstrap procedure to assess significance of the distance between two trees. Our method 
is based on the triangle inequality. Namely, if v (Ti), v(T 2 ) are estimators for f(Ti), v (T 2 ), then the triangle 
inequality says 

HTi) - W (T 2 )|| > HTO - «(T 2 )|| - |KTi) - wCfOH - \\v(T 2 ) - v(f 2 )\\, 

which gives a lower bound on the distance between the true trees Ti, T 2 . See Figure 1 for an illustration. We 
cannot compute the right-hand side of the inequality directly, because Ti, T 2 are unknown. Instead, we use the 
bootstrap to estimate the distributions of the terms H'u(Ti) — v (Ti)|| and ||f(T 2 ) — w(T 2 )||. An outline of our 
bootstrap procedure is in the Supplement. 
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Figure 1: A diagram showing two cases of the differences of means method. In Subfigure 1(a) the distance di 
between v(Ti) and v(T 2 ) is greater than the dist ance d 2 between v(Ti) and v(Tx) plus the distance <i 3 between 
v (T 2 ) and v(T 2 ), i.e., d\> d 2 + d 3 . In Subfigure 1(b) we see d% < d 2 + d 3 . 
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2.3 Difference of means 



The bootstrap procedure we have proposed can be applied with any tree estimator, such as neighbor joining 
or maximum likelihood. Since we are presuming tree uncertainty is high, and Bayes estimator trees are more 
accurate than neighbor joining or ML |[T2"1|. we prefer a Bayes estimator approach. 

Given an alignment D, generated by sequence evolution on an unknown tree T, Bayesian MCMC sampling 
methods will approximately sample from the posterior distribution P(T \ D) ~ P(D | T)P(T) ||29l . For two 
posterior distributions P(Ti | Di) and P(T 2 \ D 2 ), let t\, . . . , be a sample from P(T\ \ Di), and similarly for 
t[, . . . , t' N2 a sample from P(T 2 \ D 2 ). Then we can use v (pi) as an estimator for v (Ti), and similarly 

for v (T 2 ). The difference of means is 



2.3.1 The kernel trick for estimating 1 1 A 1 1 

Some feature space embeddings produce very high-dimensional feature vectors v (Ti), v(T 2 ), yet the distance 
||f (Ti) — v (T 2 )|| can be computed quickly without explicitly writing down the feature vectors for T\ and T 2 . 
Notable examples include Robinson-Foulds distance and quartet distance. In such cases, it would be desirable 
if the difference of means ||A|| could be estimated, by sampling trees and computing the distances between 
samples (without writing down any feature vectors). This is indeed possible, using a kernel trick: 

Proposition 1. Let Xi,x 2 ,yi,y 2 G M. m be four pairwise independent random variables, where X\ and x 2 are 
drawn according to distribution P, and y%, y 2 are drawn according to distribution Q such that E(xx) = E(x 2 ) = 
fj, x G W m andE( yi ) = E(y 2 ) = fi y G R m . Then 

\\f, x - f, y \\ 2 = E(\\ Xl - yi \\ 2 ) - l - [E(||x! - x 2 || 2 )] - i [E{\\ Vl -y 2 \\ 2 )] . (2) 

A proof of Proposition [T] is provided in the supplement. Using the proposition, and a subroutine which can 
compute \\v(t) —v(t')\\ for two given tree samples t, t', the length ||A|| = ||Eu(Ti) — Ew(T 2 )|| can be estimated 
from the samples {ti}, {^}. 

3 Results 

3.1 Simulations 

In this section we estimate posterior distributions of phylogenetic trees via MCMC-based software MrBayes 
[fTTTl and apply the difference of means method to test if two phylogenetic trees are incongruent. Note other soft- 
wares such as BEAST [3] could also be used. Simulated data sets were generating using the software Mesquite 
ifTTll with parameters chosen similar to [17], to emulate real data and test the effectiveness of our method. 
Mesquite takes two parameters; the species depth in terms of number of generations and the population size 
in terms of number of individuals. Three simulation sets were generated, determined by the species depths of 
100000, 600000, and 1000000. The effective population size was fixed to 100000 for all data sets. For each 
simulation set, two species trees, species tree one and two, with eight species were generated using the pure 
birth yule process in Mesquite. Sequence alignments were generated by Mesquite under HKY85 model 



1 1=1 

and I IAN is an estimator for ||i> (Ti) — t>(T 2 )||. 
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with transition-transversion ratio of 3.0, a discrete gamma distribution with four categories and shape parameters 
0.8. In all our simulations, we set the stationary probability distribution n = (0.3, 0.2, 0.2, 0.3) for A, C, G, T 
respectively, the 3:2 AT:GC ratio was maintained through all trees, and our sequences were generated with 1000 
base pairs. The coalescence gene trees generated had branch lengths in terms of the coalescence model and 
therefore a scaling factor of 10~ 8 was used to yield sequences with sequence divergence similar to real data. 
Table [TJ shows sequence divergences. The sequence divergence was calculated in two ways: (1) the average 
percent pairwise difference between all sequences iTTVTl . and (2) the minimum of the pairwise percent differences 
among sequences [|9]|. 



Species 
Depth 


Min 
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Median 


Q3 


Max 


Species 
Depth 


Min 


Qi 


Median 


Q3 


Max 


1000K 


0.000 


0.002 


0.005 


0.008 


0.017 


1000K SD 


0.032 


0.04 


0.043 


0.045 


0.054 


600K 


0.000 


0.003 


0.006 


0.01 


0.022 


600K SD 


0.025 


0.03 


0.032 


0.035 


0.046 


100K 


0.000 


0.001 


0.001 


0.002 


0.006 


100K SD 


0.004 


0.007 


0.008 


0.012 


0.016 



(a) Pairwise Minimum 



(b) Pairwise Average 



Table 1: QI means the first quantile and Q3 means the third quantile. By "min" we mean the smallest number 
and "max" means the largest number among a sample. Sequence divergences were calculated in two ways: 1) 
the pairwise minimum percentage of sequence divergence and 2) the average pairwise percentage of sequence 
divergence. 



Spieces 1 



Spieces 2 





Spieces 1 



Spieces 2 





Spieces 1 Spieces 2 

SD 1000K 





FDABGECH DCHBAFEG 



CDFBAHEG GEFHCABD BHAFCDGE ABHCGFED 



Figure 2: The three pairs of species trees used in our simulations. The dissimilarity maps normalized by y (2) 
between the two species trees used to generate gene trees for our simulations are 0.4333 for 1, 000, 000 species 
depth, 0.2672 for 600, 000 species depth and 0.046 for 100, 000 species depth. 

In order to estimate posterior distributions we used the MCMC -based software MrBayes with the following 
parameters: (1) for the model: HKY85 + Gamma, shape parameter: 0.8, transition-transversion ratio: 3.0; and 
(2) for MCMC runs: number of runs: 1, number of chains: 2, chain length: 100, 000, sample frequency: 1, 000, 
burn-in: 25%. For bootstrap sampling we sampled 100 bootstrap samples with sample size of 1, 000 columns 
since the simulated sequences are generated with 1, 000 base pairs. 

We generated simulated data sets in three different ways; (1) two separate sequence data sets generated 
from the same gene tree, (2) sequence data sets generated from two different gene trees under the same species 
tree, (3) sequence data sets generated by two sequence data sets generated from two different gene trees whose 
species trees are also different. We tested ten gene trees for each species depth (i.e. 30 different gene trees in 
total) generated under the same species tree. One can find the species trees we used in Figure [2j We used two 
sets of sequences generated under the HKY model with the same tree for each test. We have the three species 
depths of 1000,000, 600,000 and 100,000, with fixed population size of 100,000. Notice that we do not observe 
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1000KSD 600KSD 100KSD 1000KSD 600KSD 100KSD 1000KSD 600KSD 

Species depth 



(a) Box plots of p-values for simulated data. We generated ten gene trees for each species depth (i.e. 
30 different gene trees in total) for each species tree. Sequences for each gene tree were generated 
using the HKY model. Species depths of 1, 000, 000, 600, 000 and 100, 000 were used, with fixed 
population size of 100, 000. White box plots are for testing Type I error by comparing identical gene 
trees generated by Mesquite (we used the species tree on the right in Figure[2]), light grey box plots 
represent the p-values with two different gene trees generated from the same species tree under the 
coalescence model (we used the species tree on right in Figure[2]), and dark grey box plots summarize 
the p-values with gene trees generated from two different species trees. Some boxes of p-values are 
identically zero. Box plots were computed in R ll25l using boxplot 
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(b) These plots represent correlation between distances among true trees and p-values, and correla- 
tion between difference of means and p-values. These are computed with the same data sets in Figure 



3(a) The sample size here is 94 (i.e., there are 94 pairs of sequence data sets we tested in total). For 
more details see supplement. We fitted the data in R ll25ll using loess to perform local regression. 
The dotted lines are for 95% confidence intervals of the fitted lines. The vertical solid line is the line 
x = 0.05 which represents the a-level, p value = 0.05. 



Figure 3: 



any Type I errors with our testing method, however, in within-species comparisons at species depth of 1,000,000 



the p-values were high in general (Figure 3(a) ). Also notice that with pairs of gene trees where each pair of 
gene trees are generated from different species trees under the coalescence model, the p-values were 0.000 for 
all pairs of genes from 1,000,000 and 600,000 species depth. However, in the case of species depth 100,000 we 
see that only one pair (Species 1 _GenetreeO / Species2_Genetree7) has a p-value less than 0.05 (see Table [5] in 
the supplement). We appear to have Type II errors, probably because species trees with 100,000 species depth 
are very close to each other. 

p-values and distance between true trees appear strongly correlated. We fitted correlations between p-values 
and distance between true trees as well as correlation between p-values and the difference of means for the 



posterior distributions given the original sequence data sets, using a function called loess (Figure 3(b)). The 
fitted lines show negative correlation between the p-values and the distance between true trees and also negative 
correlation between the p-values and the difference of means. Note that the fitted lines for distances between 



true trees and for differences of means in Figure 3(b) any p-values below the a-level (0.05 in our case) are within 
their confidence intervals. Actually they are within their confidence intervals up to the p-value equals to 0.3. 
This means the differences of means with posterior distributions given the original sequence data sets are good 
measurements for distance between true trees for our statistical tests. This is particularly important since we 
usually do not know the true trees with biological data sets. For complete results of our simulations see Table [4] 
and Table [5] in the supplement. 

3.2 Experiments with real data sets 



We tested our method with a well-known gopher- louse data set IflOI . see Table 2(a) This data set contains 
17 taxa of lice and 15 taxa of gophers. In order to satisfy the requirement for an equal number of leaves for 
tree comparison we constructed 4 individual data sets reflecting all possible pairings of the two gopher species 
involved in the possible host jumps with their apparent parasitic louse species: (dataset 1) Thomomys talpoides- 
Thomomydoecus barbarae, Thomomys bottae-Thomomydoecus minor; (dataset 2) Thomomys talpoides-Geomydoecus 
thomomyus, Thomomys bottae-Thomomydoecus minor; (dataset 3) Thomomys talpoides-Thomomydoecus bar- 
barae, Thomomys bottae-Geomydoecus actuosi; (dataset 4) Thomomys talpoides-Geomydoecus thomomyus, 
Thomomys bottae-Geomydoecus actuosi. 

The posterior distributions were estimated using MrBayes with the following parameters: (1) for the model: 
GTR + Gamma + Invariant sites; (2) for MCMC: number of runs: 1, number of chains: 2, chain length: 100, 000, 
sample frequency: 1, 000, burn-in: 25%; and (3) for bootstrap sampling: 100 bootstrap samples with sample size 
of 379 columns which is the length of sequence alignments in the data sets. 

We also tested our Method with the data sets from iBTTl . After removing cases of apparent host jumps, the 
data sets contain sequences from 20 taxa of grasses and 20 taxa of endophytes. Sequences were aligned with 
the aid of PILEUP implemented in SEQWeb Version 1.1 with Wisconsin Package Version 10 (Genetics 
Computer Group, Madison, Wisconsin). PILEUP parameters were adjusted empirically; a gap penalty of two 
and a gap extension penalty of zero resulted in reasonable alignment of intron-exon junctions and intron regions 
of endophyte sequences, and of intergenic spacer and intron regions of cpDNA sequences. Alignments were 
scrutinized and adjusted by eye, using tRNA or protein coding regions as anchor points. For phylogenetic analy- 
sis of the symbionts, sequences from tubB (encoding /3-tubulin) and tefA (encoding translation elongation factor 
1-a) were concatenated to create a single, contiguous sequence of approximately 1400 bp for each endophyte, 
of which 357 bp was exon sequence and the remainder was intron sequence. For phylogenetic analysis of the 
hosts, sequences for both cpDNA intergenic regions (trnT-trnL and trnL-trnF) and the trnL intron were aligned 
individually then concatenated to give a combined alignment of approximately 2200 bp. Analysis was also 
performed using the sequences from tubB and tefA separately. 
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Gopher-louse (dataset 1) 0.640 



Data set p-value 



, , _ Grass -endophyte tefA 0.040 

Gopher-louse (dataset 2) 0.400 „ , , , „ 

_ , . , , „ „ . „ Gras s -endophyte tubB 0.080 

Gopher-louse (dataset 3) 0.840 . , . „ , 

Gnnher Innse CHataset 4^ 5Q0 Grass-endophyte tubB plus tefA 0.000 

uopner-iouse (uaiasei ti u.jyu n % , r ji^j^ t r H mi 

, * i _ v , ' _ . „ , (b) p-values for grass-endophyte data sets from [21 1, 
(a) p-values for subsets of the well-known . . . , . , 

5, . . .n-— . „ , After removing cases of apparent host lumps, the data 

gopher-louse data set in 10 . All p-values are . _ ° . , ' - 

comprises 20 taxa of grasses and 20 taxa of endo- 



high, so no significant incongruence is found. 



phytes. The first two rows compare grass phylogeny 
to gene trees for tefA and tubB in endophytes; the last 
row uses the concatenation of tefA and tubB. 

Table 2: 



tolC 



-14 E. brachyetylri 
16 E. gfyctriat 



L 



0.005 



jV. itni-inalitni ivlC\ 
1 2 N. occulta*!.* 

1 1 N. sirgtRi 

N. sp. (F . at.) laid 

.3 E. tinuirfflijni 

JV. <:himsum toiCI 
S. baoonfl 
15 E-feititrae 

f-l N. coenojihitihmr 
7 N. wtdmilwt ioiCl 
—2 N. sp. (P. au.) 
— 17 N. sp. {L. nr.) 
N. ip.(F.u!-)tolC2 
N. atilvur-fMie 




Figure 4: Trees with maximum likelihood identified by MCMC search on aligned intron sequences from the lolC, tefA, 
and tubB genes of Epichloe and Neotyphodium species. Some Neotyphodium species are interspecific hybrids that have 
multiple genomes from different ancestors. The genes from the different genomes are distinguished, for example, as lolCl 
and lolCl. The same number labeling leaves on the three trees indicates genes from the same genome from the same fungal 
isolate. 
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The posterior distributions were estimated using MrBayes with the following parameters: (1) for the model: 
GTR + Gamma + Invariant sites; (2) for MCMC: number of runs: 1, number of chains: 2, chain length: 100, 000, 
sample frequency: 1, 000, burn-in: 25%; and (3) for bootstrap sampling: 100 bootstrap samples, number of 
bootstrap columns equals length of original alignment. 

These results are interesting in comparison with the prior finding of significant relationship between the 
phylogenies of the grasses and their endophytes lETTl . The previous analysis indicated a significant relationship 
between ages of corresponding nodes in endophyte and grass phylogenies, addressing whether divergences 
of grass and endophyte clades tended to occur at approximately the same time. In contrast, results of the 



analysis above suggest that the grass and endophyte phylogenies are significantly different (Table 2(b)). We 
conclude that such a relationship of node ages does not necessarily imply similar phylogenetic histories. This is 
reasonable because the relationships of grasses and their endophytes is expected to be one of diffuse cospeciation 
at best. Individual species of endophyte may be associated with genera or tribes of grasses, but rarely with 
individual species. This contrasts with the gopher-gopher louse situation, where evidence suggests a much 



stricter coevolutionary relationship (Table 2(a) ). 

We chose an additional biological data set to compare phylogenies of genes that occur together in endophyte 
genomes. Whereas tefA and tubB are housekeeping genes present in all isolates, lolC is a secondary metabolism 
gene sporadically present in endophyte isolates (|22|. It has been suggested that such sporadically occurring 
secondary metabolism genes may be distributed in fungi largely by horizontal gene transfer ||2~8~1 . To investigate 
this possibility in the case of lolC, we used our approach to test whether the phylogenies of these three genes were 
significantly different. The most likely trees obtained by MCMC showed related but nonidentical topologies 
(Figure [4| note placement of genes from Epichloe festucae and Epichloe brachyelytri). Our test found no 
significant difference between the phylogenies, although the p- values appear stochastically smaller than the 
p-values observed for simulated data under the null. This perhaps reflects the conservative nature of our test. 
Removing either Epichloe festucae or Epichloe brachyelytri) altered the results only slightly (Table 3(b)[ ). These 
results indicate that lolC evolution was largely or exclusively by decent, and disfavored horizontal transfer as an 
explanation for the sporadic distribution of this gene. 



Data set 



p-value 



Data set 



lolC vs. tefA 0.390 

lolC vs. tubB 0.560 

tefA vs. tubB 0.940 
(a) The results with our statistical method 
with the endophyte data sets from lolC, 
tubB, tefA genes. There are 17 taxa in each 
data set. 



p-value 



lolCvs.tefA 0.230 

lolCws.tubB 0.340 

tefA vs. tubB 0.870 
(b) The results with our statistical method 
with the endophyte data sets from lolC, 
tubB, tefA genes after removing E2368. 
There are 16 taxa in each data set. 



Table 3: 



3.3 Toolkit for Computational Experiments 

To facilitate computations for our experiments, we developed a set of programs, collectively called Phy lot ree. 
Phylotree is organized as a collection of scripts for running a complete computational experiment starting 
from sequence alignments, then sampling phylogenetic trees and computing distances between phylogenetic 
trees and their distributions (see Section [2]). Supported distance measures include path-difference, dissimilarity 
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map distance, Robinson-Foulds distance. Available scripts allow for selecting the number of columns and the 
number of bootstrap samples, linking taxa in the alignments and provide flexibility for using different sampling 
methods (e.g., MrBayes or BEAST) and distance measures. This is free software, and will be distributed 
under the terms of the GNU General Public License. One can download the software at|http ://csurs7. 



csr . uky . edu/phylotree/ The website is password protected and login information can be obtained at 



http : / /cophylogeny . net /research . php| 



4 Discussion 



In this paper we presented a method to determine if two phylogenetic trees with given alignments are signifi- 
cantly incongruent. Our method computes the difference of means of posterior distributions of trees, which has 
the advantage of using entire tree distributions, as opposed to single tree estimators. 

In this paper we used the triangle inequality (d\ < d 2 + d 3 in Figure 1) to derive a bootstrap procedure 
to compute p-values. However, our bootstrap procedure appears to be very conservative, producing p-values 
whose null distribution is stochastically much larger than uniform {7(0, 1). Thus in order to increase the power 
we might want to consider different criteria for computing p-values. One approach may be to define v(Ti), v(T 2 ) 
to be the average of bootstraps {v(T*)}, {v(T 2 *)}, rather than the initial tree estimates. Another possibility is to 
replace the triangle inequality with a max condition (e.g. in Figure 1 use the condition d\ < max(d 2 , ofa)). We 
explored this in the supplementary material, and it seems that the max condition provides much more power, but 
is somewhat anti-conservative. 

In this paper we used the dissimilarity map as feature space. However, there other common tree features 
which can be used to define different feature spaces. Examples of distances derived from tree features include 
(normalized) Robinson-Foulds distance [fP9ll : quartet distance [5J; and the path difference metric 11231 . Of course, 
in all the above examples, we could choose any vector space norm, such as L p for any p. The important point 
is that there are many different useful features (i.e., choices of vector space embeddings) which can be used to 
analyze trees, and many such as splits and quartets have already been used for quite some time. Moreover, with 
the kernel trick presented above we can efficiently calculate distances between distributions of trees using the 
Robinson-Foulds and quartet distance. Thus it is interesting to use different feature spaces for our statistical 
method, and we leave this for future work. 
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7 Supplement Materials 



7.1 Bootstrap procedure 

Here is the outline of our bootstrap procedure we have used in the main manuscript. 

Algorithm 1. Input; Alignments D 1 and D 2 

Output; An estimated p-value, under the null H : \ \v(Ti) — v(T 2 ) \ \ = 

1. Compute tree estimates v{fi),v{f 2 ) from the data Di, D 2 . 

2. Compute d = \\v(Ti) — v(T 2 )\\. 

3. Let p = and for i = 1 to N do 

• Take a bootstrap sample DI from the columns of alignment D\ and a bootstrap sample D 2 * from the 
columns of D' . 

• Compute tree estimates v(T\*), v(T 2 *) from the data DI, D 2 *. 

• Test the condition: if d < \\v(Ti) — v(Ti*)\ \ + \ \v(T 2 ) — v(T 2 *)\ \ then set p = p + 1. 

4. Set p = jr and return p as the p-value of the hypothesis test. 

During the "Test the condition" step, one could alternatively use the less conservative condition d < max(| \v(Ti) — 
«(r a *)||,|Kf a )-t;(T 2 *)||) 

7.2 Proof of Proposition [1] 

Proof. For any a e IR m , let a T be the transpose of a. Since x 1 ,x 2 ,yi,y 2 G IR m are mutually independent, the 
bilinearity of dot-product gives E(xJ x 2 ) = E(xi) T E(x 2 ) and similarly E(yf y 2 ) = E(yi) T E(?/ 2 ) and K(xjyi) = 
E(xi) T E(yi). Also for any vectors a, b G M m , we have the identity a T b = b T a. 
Then we have 

K(||x! -2/i|| 2 ) 
= E((x 1 -y 1 ) T (x 1 -y l )) 
= E(xjx 1 + y[y x - x\y x - yjx^ 

= E(xf Xl ) + E{y? yi ) - E(xfyi) - E(yfx x ) 1 ; 

= E(xJ Xl ) + E(yl Vl ) - E(x 1 ) T E(y 1 ) - E(y 1 ) T E(x 1 ) 
= E(xTx 1 )+E(y{y 1 )-2nZn y . 
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But 



Thus we have 



E [(xi - x 2 + x 2 ) T (x 1 - x 2 + x 2 ) 
E[(xi - x 2 ) T (xi - x 2 ) + x^{xi - x 2 ) 
+ - x 2 ) T x 2 + x^x 2 ] 
E [(xi - x 2 ) T (xi - x 2 )] + E [xl(xi - x 2 )] 
+E [(xi - a; 2 ) T a; 2 ] + E(a£z 2 ) 
E [(xi - x 2 ) T (x l - x 2 )] + E(x 2 n x 1 ) - E(a£x 2 ) 
+E(xf x 2 ) - E(a£ar 2 ) + E(x^a; 2 ) 
E [(xi - x 2 ) T (xi - x 2 )] + E(4xi) - E(a£ar 2 ) 
+E(xfx 2 ) 

E [(xi - x 2 ) T (xi - x 2 )] + /j%fi x - E(x1x 2 ) 



E 
E 



(xi - x 2 ) T (xi - x 2 ) 
(x 1 -x 2 ) T {xi -x 2 ) 



+ 2fi^fi x - E(x2 x 2 ) 
+ 2fi^fi x - E(xjx 1 ) 



2E(x^xi) = E [(xi - x 2 ) T (xi - x 2 )] + 
By dividing the both sides of the equation in (|4]) by 2 we have: 



(4) 



Similarly we have 



E(xfxi) = -E [(zi - x 2 ) T (xi - x 2 )] + nln x . 



E{yT yi ) = ^E [{ Vl - y 2 ) T ( yi - y 2 )] + fi^ y . 



(5) 



(6) 



Then we substitute E(xfxi) and E(yjy 1 ) in equations Q and (|6]) into the equation in Q, we have 

E(||*i-2/i|| 2 ) 
= [|E [(a?! - x 2 ) T (xi - x 2 )] + fgnx] 
+ [|E - y 2 ) T (y 1 - y 2 )] + ^] 

= |E [(xi - x 2 ) T (x! - x 2 )l + |E - y 2 ) T ( yi - y 2 )] 

= |E [(x! - a^Hx! - x 2 )] + |E [( Vl - y 2 ) T ( yi - y 2 )} 
+ \\Vx + Vy\\ 2 - 



(7) 



□ 



7.3 Supplement for simulation studies 

Throughout the remaining subsections, Condition (i) refers to the condition tested in Algorithm[l] d < \ \v(Ti) — 
u(Ti*)||+||v(f 2 )-v(r2*)||. Condition (ii) refers to the alternative condition d < max(||w(fi)-w(Ti*)||, ||v(f 2 )- 
v(T 2 *)\\). 



7.4 Results with biological data sets 
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Data set 




Gopher- louse (dataset 1) 


0.640 


0.020 


0.120 


Gopher-louse (dataset 2) 


0.400 


0.020 


0.128 


Gopher-louse (dataset 3) 


0.840 


0.070 


0.117 


Gopher-louse (dataset 4) 


0.590 


0.030 


0.122 


Grass-endophyte (tefA gene) 


0.040 


0.000 


0.073 


Grass-endophyte (tubB gene) 


0.080 


0.010 


0.088 


Grass-endophyte (tefAJubB genes) 


0.000 


0.000 


0.073 



Table 6: The numbers in this table are p-values for our statistical hypothesis testing estimated by our method 
with the grass-endophyte data sets from ll2"Tll . dist represents the average of the normalized difference of means 
with posterior distributions given the original sequence data sets. 









rmuted 




Data set 






D 


'-a 


(- E2368) 










endophyte 16-taxa (lolC vs tubB) 


0.340 


0.000 


0.000 


0.100 


endophyte 16-taxa (lolC vs tefA) 


0.230 


0.000 


0.000 


0.105 


endophyte 16-taxa (tefA vs tubB) 


0.870 


0.050 


0.000 


0.080 


(-E1125) 










endophyte 16-taxa (lolC vs tubB) 


0.670 


0.010 




0.093 


endophyte 16-taxa (lolC vs tefA) 


0.870 


0.040 




0.073 


endophyte 16-taxa (tefA vs tubB) 


0.950 


0.160 




0.069 



Table 7: The numbers in this table are p-values for our statistical hypothesis testing estimated by our method 
with the endophyte data sets from tefA, tubB, and lolC genes, dist represents the average of the normalized 
difference of means with posterior distributions given the original sequence data sets. (- E2368) means that they 
are with the data sets after removing E2368 and (- El 125) means that these are the results with the data sets after 
removing El 125. 
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